The CAZyme prediction tools or classifiers dbCAN, CUPP and eCAMI were independently evaluated against a high quality benchmark test set. The performances were evaluated upon the CAZyme/non-CAZyme differentiation, multilabel CAZy class classification, and multilabel classification of CAZy family annotations.
Results summary:
- dbCAN and DIAMOND showed the strongest performances in CAZyme/non-CAZyme differentiation
- dbCAN was the strongest performing tool across all categories, Hotpep (a tool invoked by dbCAN) was the weakest
- The performances between CUPP and eCAMI were similar, although CUPP showed a marginally better performance when comparing the multilabel classification of CAZy family annotations
- The performance of dbCAN may be optimised by substituting Hotpep with CUPP and/or eCAMI
The CAZyme prediction tools or classifiers classifiers dbCAN (Zhange et al. 2018), CUPP (Barrett and Lange, 2019) and eCAMI (Xu et al. 2019) use different methods to predict if a protein is a CAZyme or non-CAZyme, and predict the CAZy family annotations for predicted CAZymes. These classifiers have not been independently, comprehensively or reproducibly evaluated against a high quality benchmark test set.
The Python package pyrewton was used to create the test sets for the evaluation, invoke the CAZyme classifiers, and perform the statistical evaluations of the performances (using the sklearn library).
This notebook layouts out the independent, reproducible and comprehensive evaluation of dbCAN, CUPP and eCAMI against a high quality benchmark test set. The tools were evaluated at three levels of CAZyme classification: CAZyme/non-CAZyme, CAZy class and CAZy family classification. Specifically, this evaluation evaluates the performance of:
- Binary CAZyme / non-CAzyme classifications
- Multilabel classification of CAZy class classifications
- Binary CAZy class classification of each CAZy class indpendent of all other CAZy classes
- Multilabel classification of CAZy family classifications
- Binary CAZy family classification of each CAZy class indpendent of all other CAZy families
dbCAN incorporates the three protein function classifiers HMMER (Potter et al. 2018), Hotpep (Busk et al. 2017), and DIAMOND (Buchfink et al. 2015). In order to comprehensively evaluate the preformance of dbCAN, the predictions from HMMER, Hotpep and DIAMOND were evaluated independently of each other, and the consensus prediction (a prediction which at least two of the tools agree upon) was defined as the dbCAN result.
70 test sets containing 100 CAZymes and 100 non-CAZymes each, were used in the evaluation. The CAZyme classifiers parsed the same 70 test sets.
Each test set was created from a unique genomic assembly. From each genomic assmembly, 100 CAZymes were selected at random, and the 100 non-CAZymes with the highest sequence similarity to 100 selected CAZymes were included in the test set.Choosing the 100 non-CAZymes with the highest sequence similarity was increased probability of causing confusion. Therefore, the performance of the CAZyme classifiers were evaluated test sets designed to cause the classifiers the greatest confusion, and this produce a baseline of each classifiers performance and avoid providing an overoptimisic evaluation of their perforamnce. An equal number of CAZymes to non-CAZymes was selected to prevent over representation of one population over the other.
For inclusion of a genomic assembly for the creation of a test set, the assembly had to meet of all the following criteria:
The genomic assemblies were also chosen from a range of taxonomies to provide as comprehensive understanding of the performance of the classifiers over a range of datasets.
Table 2.1 contains the genomic assemblies used to create the test sets for the evaluation. In total 70 assemblies were chosen:
| Phylogeny | Strain | NCBI Taxonomy ID | GenBank Assembly Accession | Number of CAZymes in CAZy |
|---|---|---|---|---|
| Oomycetes | Dictyoglomus turgidum DSM 6724 | 515635 | GCA_000021645.1 | 101 |
| Fungi Ascomycetes | Aspergillus flavus NRRL 3357 | 227321 | GCA_009017415.1 | 441 |
| Aspergillus chevalieri M1 | 182096 | GCA_016861735.1 | 521 | |
| Metarhizium brunneum 4556 | 500148 | GCA_013426205.1 | 394 | |
| Peltaster fructicola (ascomycetes) LNHT1506 | 403677 | GCA_001592805.2 | 267 | |
| Penicillium digitatum (ascomycetes) PdW03 | 36651 | GCA_016767815.1 | 318 | |
| Micromonas commoda (green algae) RCC299 | 296587 | GCA_000090985.2 | 148 | |
| Yarrowia lipolytica DSM 3286 | 4652 | GCA_014490615.1 | 133 | |
| Botrytis cinerea B05.10 | 332648 | GCA_000143535.4 | 341 | |
| Eremothecium gossypii ATCC 10895 | 284811 | GCA_000091025.4 | 108 | |
| Kluyveromyces lactis CBS 2105 | 28985 | GCA_007993695.1 | 242 | |
| Kluyveromyces lactis NRRL Y-1140 | 284590 | GCA_000002515.1 | 118 | |
| Kluyveromyces marxianus CBS4857 | 4911 | GCA_001854445.2 | 123 | |
| Pyricularia oryzae | 318829 | GCA_004346965.1 | 550 | |
| Sugiyamaella lignohabitans CBS 10342 | 796027 | GCA_001640025.2 | 150 | |
| Fusarium culmorum Class2-1B | 5516 | GCA_016952355.1 | 486 | |
| Fusarium oxysporum Fo47 | 660027 | GCA_013085055.1 | 719 | |
| Fusarium pseudograminearum Class2-1C | 101028 | GCA_016952305.1 | 485 | |
| Cordyceps militaris ATCC 34164 | 73501 | GCA_008080495.1 | 319 | |
| Clavispora lusitaniae P5 | 36911 | GCA_009498115.1 | 135 | |
| Yeast | Brettanomyces bruxellenesis UCD 2041 | 5007 | GCA_011074885.2 | 140 |
| Pichia kudriavzevii CBS573 | 4909 | GCA_003054445.1 | 137 | |
| Brettanomyces nanus CBS 1945 | 13502 | GCA_011074865.2 | 140 | |
| Metschnikowia aff. pulcherrima (budding yeasts) APC 1.2 | 2163413 | GCA_004217705.1 | 136 | |
| Zygosaccharomyces parabailii ATCC 60483 | 1365886 | GCA_001984395.2 | 220 | |
| [Candida] glabrata BG2 | 5478 | GCA_014217725.1 | 146 | |
| [Candida] auris B11220 | 498019 | GCA_003013715.2 | 131 | |
| [Candida] auris B11245 | 498019 | GCA_008275145.1 | 131 | |
| Candida dubliniensis CD36 | 573826 | GCA_000026945.1 | 140 | |
| Eukrayote | Ostreococcus lucimarinus CCE9901 | 436017 | GCA_000092065.1 | 115 |
| Chloropicon primus CCMP1205 | 1764295 | GCA_007859695.1 | 221 | |
| Gram Positive Bacteria | Hungateiclostridium thermocellum ATCC 27405 | 203119 | GCA_000015865.1 | 144 |
| Hungateiclostridium clariflavum DSM 19732 | 720554 | GCA_000237085.1 | 147 | |
| Alicyclobacillus sp. SO9 | 2665646 | GCA_016406125.1 | 113 | |
| Bacillus altitudinis 11-1-1 | 293387 | GCA_013283915.1 | 100 | |
| Bacillus amyloliquefaciens MOH1-5b | 1039 | GCA_014792065.1 | 102 | |
| Bacillus amyloliquefaciens KHG19 | 1292358 | GCA_000835145.1 | 101 | |
| Dickeya chrysanthemi Ech1591 | 561229 | GCA_000023565.1 | 108 | |
| Dickeya dianthicola ME23 | 1940567 | GCA_003403135.1 | 116 | |
| Enterococcus faecium isolate 2014-VREF-268 | 1352 | GCA_002025045.1 | 104 | |
| Enterococcus casseliflavus EC291 | 37734 | GCA_009707345.1 | 145 | |
| Clostridium saccharoperbutylacetonicum N1-504 | 36745 | GCA_002003305.1 | 214 | |
| Clostridium beijerinckii NCIMB 14988 | 1520 | GCA_000833105.2 | 193 | |
| Ruminiclostridium cellulolyticum H10 | 394503 | GCA_000022065.1 | 144 | |
| Streptomyces bingchenggensis BCW-1 | 749414 | GCA_000092385.1 | 387 | |
| Streptomyces sporoclivatus NBRC 100767 | 284038 | GCA_009936315.1 | 361 | |
| Schleiferilactobacillus harbinensis NSMJ42 | 304207 | GCA_008694105.1 | 153 | |
| Streptacidiphilus sp. P02-A3a | 2704468 | GCA_014084105.1 | 288 | |
| Streptosporangium roseum DSM 43021 | 479432 | GCA_000024865.1 | 254 | |
| Nocardia arthritidis AUSMDU00012717 | 228602 | GCA_011801145.1 | 189 | |
| Mycobacterium sp. JS623 | 212767 | GCA_000328565.1 | 136 | |
| Gram Negative Bacteria | Actinobacillus equuli NCTC9435 | 718 | GCA_900638075.1 | 117 |
| Azospirillum brasilense Sp 7 | 192 | GCA_001315015.1 | 161 | |
| Caulobacter segnis ATCC 21756 | 509190 | GCA_000092285.1 | 116 | |
| Cellvibrio japonicus Ueda107 | 498211 | GCA_000019225.1 | 222 | |
| Enterobacter asburiae CAV1043 | 61645 | GCA_003940765.1 | 205 | |
| Escherichia coli 142 | 562 | GCA_005221905.1 | 100 | |
| Escherichia coli 144 | 562 | GCA_005221585.1 | 282 | |
| Klebsiella aerogenes 035 | 548 | GCA_011604725.1 | 110 | |
| Klebsiella michiganensis BD177 | 1134687 | GCA_010093005.1 | 157 | |
| Klebsiella oxytoca KONIH4 | 571 | GCA_002906395.1 | 162 | |
| Pseudobacter ginsenosidimutans Gsoil 221 | 661488 | GCA_007970185.1 | 233 | |
| Pseudomonas cerasi | 1583341 | GCA_900074915.1 | 101 | |
| Salmonella enterica subsp. arizonae NCTC10047 | 59203 | GCA_900635675.1 | 146 | |
| Serratia marcescens 11/2010 | 615 | GCA_013426155.1 | 105 | |
| Serratia marcescens SM39 | 1334564 | GCA_000828775.1 | 101 | |
| Verrucomicrobia bacterium HZ-65 | 2026799 | GCA_002310495.1 | 282 | |
| Verrucomicrobia bacterium IMCC26134 | 1637999 | GCA_000972765.1 | 181 | |
| Xanthomonas citri subsp. citri A306 | 1308541 | GCA_000816885.1 | 171 | |
| Xanthomonas citri subsp. citri Aw12879 | 1137651 | GCA_000349225.1 | 170 |
The assignment of CAZy family annotations to proteins by a CAZyme classifier, identifies the protein as a CAZyme. If no CAZy family annotations are assigned to a protein by a CAZyme classifier, the protein is identified as a non-CAZyme. This section of notebook evaluates the performance of the CAZyme classifiers dbCAN (which incorporates HMMER, Hotpep and DIAMOND), CUPP and eCAMI for this binary CAZyme/non-CAZyme classification.
For each classifier, for each test set the specificity, sensitivity (recall), precision, F1-score and accuracy were calculated. The mean of each statistical parameter was calculated for each classifier across all tests, to represent the overall performance of each CAZyme classifier. These results are presented in table 3.1. The performances of the classifiers for each statistical parameters are discussed in separate sections below.
| Prediction Tool | Mean Specificity | Specificity Standard Deviation | Mean Recall | Recall Standard Deviation | Mean Precision | Precision Standard Deviation | Mean F1-score | F1-score Standard Deviation | Mean Accuracy | Accuracy Standard Deviation |
|---|---|---|---|---|---|---|---|---|---|---|
| dbCAN | 0.9820 | 0.0471 | 0.8979 | 0.1210 | 0.9824 | 0.0436 | 0.9323 | 0.0915 | 0.9399 | 0.0654 |
| dbCAN-HMMER | 0.9836 | 0.0448 | 0.8777 | 0.0849 | 0.9833 | 0.0433 | 0.9245 | 0.0673 | 0.9306 | 0.0504 |
| dbCAN-Hotpep | 0.9766 | 0.0497 | 0.8174 | 0.1312 | 0.9752 | 0.0481 | 0.8823 | 0.0919 | 0.8970 | 0.0679 |
| dbCAN-DIAMOND | 0.9833 | 0.0389 | 0.8857 | 0.1576 | 0.9829 | 0.0396 | 0.9215 | 0.1246 | 0.9345 | 0.0816 |
| CUPP | 0.9820 | 0.0479 | 0.8541 | 0.0806 | 0.9821 | 0.0438 | 0.9108 | 0.0531 | 0.9181 | 0.0447 |
| eCAMI | 0.9766 | 0.0489 | 0.8580 | 0.1346 | 0.9766 | 0.0455 | 0.9062 | 0.0910 | 0.9173 | 0.0680 |
Specificity is the proportion of known negatives (in this case known non-CAZymes) which are correctly classified as negatives/non-CAZymes. Figure 3.1 is a graphical representation of the results calculated in table 3.1.
Figure 3.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of specificities across all test sets.
All tools showed a low probability of misclassifying non-CAZymes as CAZymes, inferring CAZyme predictions by these tools showed be treated as confident predictions. The weakest tools in this catagory were the k-mer methods Hotpep and eCAMI. The third k-mer method, CUPP, showed a similar performance to dbCAN.
Sensitivity (recall) is the proportion of known CAZymes that are correctly identified as CAZymes. Figure 3.2 shows the sensitivity for each test set for each classifier.
Figure 3.2: One-dimensional scatter plot of sensitivity (recall) of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of sensitivities across all test sets.
DIAMOND and dbCAN demonstrated the stongest performances with the highest mean sensitivities, and highest quartile values. Hotpep showed the weakest performance with the lowest mean and greatest interquartile range, showing poor consistency in performance.
The mean sensitivity across all test sets for eCAMI (0.8580 to 4 d.p.) was greater than that of CUPP (0.8541 to 4 d.p.). However, the standard deviation for eCAMI was greater than CUPP, and so was the interquartile range. Therefore, eCAMI potentially has a higher probability of correctly identifying a known CAZyme as a CAZyme, but the range in performance is less consistent than CUPP.
The sensitivities of all the classifiers infers that it is unlikely they will identify the complete CAZome from a candidate species, although the classifiers will identify the majority of CAZymes within the CAZyome. dbCAN and DIAMOND will identify at least 90% of CAZymes within the CAZome, eCAMI and Hotpep will tend to identify 80-90% of a species CAZome.
Precision is the proportion of positive predictions by the classifiers that are correct. In this case, precision represents the fraction of CAZyme predictions by the classifiers that are correct, specifically the proportion of predicted CAZymes that are known CAZymes. Figure 3.3 depicts the precision of classifer for each test set 3.1.
Figure 3.3: One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying a box plot of interquartile ranges to represent the distribution of precisions across all test sets.
All tools demonstrated that a vast majoirty of CAZyme (positive) predictions are correct, and the tools generate few false positives. This infers high confidence can be assigned to CAZyme (positive) predictions generated by the CAZyme classifiers, however, taking into consideration, recall, the classifiers will not identify all CAZymes within a CAZome.
Again, all tools demonstrated a similar strength in performance, except the the k-mer based methods Hotpep and eCAMI.Based upon the standard deviation of prediction scores acros all test sets, Hotpep and eCAMI are highly likely to generate a larger proportion of false positives than all other CAZyme classifiers evaluated, with approximately 3-5% of CAZyme predictions being false positives from Hotpep and eCAMI.
The F1-score is a harmonic (or weighted) average of recall and precision and provides an idea of the overall performance of the tool, 0 being the lowest and 1 being the best performance. Figure 3.4 shows the F1-score from each test set, for each classifier.
Figure 3.4: One-dimensional scatter plot of the F1-score of CAZyme and non-CAZyme predictions per test set, overlaying boxplot of interquartile ranges to represent the distribution of F1-scores across all test sets.
dbCAN and DIAMONd had the highest quartile values but HMMER produced a higher mean F1-score and smaller interquarile range than DIAMOND. Therefore, dbCAN, HMMER and DIAMONd demonstrated the strongest performances.
Hotpep showed the weakest performance with the lowest mean F1-score and greatest interquartile range, inferring poor performance consistency.
CUPP demonstrated a stronger performance than eCAMI with a higher mean F1-score, smaller standard deviation and interquartile ranging inferring in general CUPP will produce a higher F1-score and has a more consistent performance than eCAMI.
Accuarcy (calculated using (TP + TN) / (TP + TN + FP + FN) ) provides an idea of the overall performance of the classifiers as a measure of the degree to which their CAZyme/non-CAZyme predictions conforms to the correct result. Figure 3.5 is a plot of respective data from table 3.1.
Figure 3.5: One-dimensional scatter plot of accuracies of CAZyme and non-CAZyme predictions per test set, overlaying a boxplot of interquartile ranges to represent the distribution of accuracies across all test sets.
Similar to the F1-scores, dbCAN and DIAMOND showed the best performance. Arguably, Hotpep demonstrated the worst performance although it was similar to that of the other k-mer methods, CUPP and eCAMI. This infers that alone, the k-mer methods are not as efficient at differentiating between CAZymes and non-CAZymes as methods that rely on a more global sequence similarity, such as HMMER and DIAMOND.
The statistics evaluated above provide an idea of the general performance of the tools, but they do not provide an idea of the expect range of performance. Specifically, the data does not provide a clear image of the best and worse performance a user can expect when using these tools.
To compare the expected typical range in accuracies for each classifier, 6 test sets (identified by the source genomic assemblies) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times each, and for each bootstrap sample the accuracy calculated. The accuracies of the bootstrap samples for each classifier were plotted on stacked histograms, shown in figure 3.6.
Figure 3.6: Stacked histograms of bootstrap sample accuracies of CAZyme classifiers’ differentiation between CAZymes and non-CAZymes. 6 test sets (identified by their source genomic assembly) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times. The accuracy of each of the 600 bootstrap samples per test set were plotted as a stacked histogram.
Few of the known non-CAZymes were classified as CAZymes by the CAZyme classifiers. The cause of the non-CAZymes being classified as CAZymes may be because:
- of a very high sequence similarity between the non-CAZyme and known CAZymes
- CAZy incorrectly classifying the non-CAZyme as a CAZyme
- Exclusion of a protein from CAZy is not a strong enough method to definiatively define a protein as a non-CAZyme
The latter two points maybe true if all 6 classifiers classify the non-CAZyme as a CAZyme.
To explore the first point the Blast Score ratios of all false positive CAZyme predictions were plotted on a boxplot.
(#fig:fp.cor)BLAST Score ratio of all CAZy classified non-CAZymes falsely classified as CAZymes by at least 5 of the CAZyme prediction tools dbCAN, HMMER, Hotpep, DIAMOND, CUPP and eCAMI
Figure @ref(fig:fp.cor) demonstrates that there is no correlation between the probability of protein not being classified in CAZy and being clasified by the CAZyme prediction tools as a CAZyme.
This leave the latter two reasons for cause of false positive classification of proteins not included in CAZy being classified by the CAZyme prediction tools as CAZymes: - CAZy incorrectly classifying the non-CAZyme as a CAZyme - Exclusion of a protein from CAZy is not a strong enough method to definitively define a protein as a non-CAZyme
These two points both allude to the concept that CAZy maybe the most comprehensive CAZyme database, but it is not comprehensive. This is a very likely possibility owing to out sequencing capcity far exceeding our capcity to accuractly annotate protein functions, therefore, it is very likely there are CAZymes that have not yet been analysed by CAZy. Consequently, exclusion from CAZy should maybe not be interrpreted as definitive identification of a non-CAZyme annotation.
…add in table to potential of ‘non-CAZymes’ being CAZymes… … … …
The CAZyme prediction tools predict the CAZy family annotations of CAZymes. CAZy families are catalogued into one of size CAZy classes (definitions taken from www.cazy.org):
It may be that a prediction tool is unable to accurately to predict the specific CAZy family for a protein but can accurately predict the correct CAZy class. This section of the notebook evaluates the performance of each of the prediction tools to predict the correct CAZy class, irrespective if the child CAZy family prediction is correct. No previous evaluations of the CAZyme prediction tools have evaluated the performance the tools at the level of CAZy class prediction.
A single CAZyme can be included in multiple CAZy classes leading to the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy classes the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.
The RI is the measure of agreemment across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct) 4.1. The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy class annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct 4.2.
Figure 4.1: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 4.2: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Evaluate the performance of each prediction for each CAZy class, independent of the performance of the other CAZy classes. Excluded true negative non-CAZyme predictions.
figure 4.3
Figure 4.3: Proportional area plot shaded to represent the distribution of Fbeta-score for each CAZy class for each test set parsed by CAZyme prediction tools.
A single CAZyme can be included in multiple CAZy families, from multiple different CAZy classes, resulting in the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy families the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.
The RI is the measure of agreemment across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct) 5.1. The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy family annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct 5.1.
Figure 5.1: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy families.
Figure 5.2: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy families.
Evaluate the performance of each prediction for each CAZy family, independent of the performance of the other CAZy families. Excluded true negative non-CAZyme predictions.
figure 5.3
Figure 5.3: Proportional area plot shaded to represent the distribution of Fbeta-score for each CAZy family for each test set parsed by CAZyme prediction tools.
To evaluate the performance of each prediction tool for each CAZy family, the families were grouped by their CAZy class and the specificity for each prediction tool was plotted against the sensitivity for each CAZy family. To find families that most tools showed a poor performance (defined as a expression(paste(“F”, beta, “-score”))-score less than 0.75), heatmaps of the expression(paste(“F”, beta, “-score”)) for CAZy families for which at least 3 tools produced a expression(paste(“F”, beta, “-score”)) less than 0.75, also comparing the size of the number of CAzyme records in family in CAZy and the number of family members included across all test sets.
Figure 5.4 shows the specificity and sensitivity of each prediction tool for each Glycoside Hydrolase family. All prediction tools showed a very strong performance for specificity, with no tool producing a specificity score less than 0.995.
dbCAN had the most families with a sensitivity greater than or equal to 0.9 (99 families), closely followed by HMMER and CUPP (with 97 families each). However, dbCAN and HMMER had more families with a sensitivity greater than or equal to 0.75 than CUPP (114, 113 and 103 families respectively). Therefore, dbCAN and HMMER showed the strongest performances for GH families.
dbCAN-Hotpep, CUPP and eCAMI showed the weakest performances, with the most families producing a sensitivity score less than 0.75. However, eCAMI and dbCAN-Hotpep had the most families with a specificity score less than 0.995, although the specificity scores from Hotpep were lower than that of eCAMI. Therefore, dbCAN-Hotpep showed the weakest performance, although overall the performances between the tools were similar. dbCAN demonstrated the strongest performance with the most families with a sensitivity greater than 0.75 and specificity of 1.
Figure 5.4: Scatter plot of specificity against sensitivity for each CAZy family within the Glycoside Hydrolase class. Hover cursor over each point to see the specific sensitivity and specificity.
Figure 5.4: Scatter plot of specificity against sensitivity for each CAZy family within the Glycoside Hydrolase class. Hover cursor over each point to see the specific sensitivity and specificity.
To identify GH CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.5. It was no suprise GH0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as GHs but cannot not determine the CAZy family annotation, therefore GH0 includes members from multiple different CAZy families. Thus, GHO has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for GH0 is typcially lower than other CAZy families for all prediction tools. For families GH163-GH170, these families are not included into the models within the prediction tools (except HMMER which does included GH163), therefore, these tools cannot predict members of these families.
The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is signficantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).
Figure 5.5: Heatmap of Glycoside Hydrolases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.
Figure 5.6 shows the specificity against sensitivity from each CAZyme prediction for each GT family. All tools showed an extremely strong performance for specificity, no tool produced a specificity of less than 0.9985.
The k-mer based methods, dbCAN-Hotpep, CUPP and eCAMI showed the weakest performances because they had the most families with a sensitivity of less than 0.75.
HMMER had the most families with a sensitivity equal to or greater than 0.9 (51 GT families); however, DIAMOND had the most families with a sensitivity equal to or greater than 0.75, and had the fewest families with a sensitivity less than 0.75 (58 and 11 GT families respectively). Therefore, HMMER and DIAMOND both showed the strongest performances.
dbCAN had the most families with a specificity greater than 0.99975, but the difference in specificity scores was so small that it was not possible to differentiate the performance of the predictions tool by specificity. However, DIAMOND had the most families with a sensitivity greater than 0.75, and thus showed the strongest performance for sensitivity out of all the prediction tools, for predicting GT families.
Figure 5.6: Scatter plot of specificity against sensitivity for each CAZy family within the GlycosylTransferases class. Hover cursor over each point to see the specific sensitivity and specificity.
To identify GT CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.7. It was no suprise GT0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as GTs but cannot not determine the CAZy family annotation, therefore GT0 includes members from multiple different CAZy families. Thus, GT0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for GT0 is typcially lower than other CAZy families for all prediction tools.
For families GT109-113, these families are not included into the models within the prediction tools, therefore, these tools cannot predict members of these families.
The remaining families (except GT29 and GT31), contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is signficantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).
For GT29 and GT31 had 30 and 20 family members included across all test sets respectively. These samples sizes are suitable for producing an accurate representation of the performance of the prediction tools performances for these families. A potential reason for the poor performance from most of the prediction tools for these families is that GT29 and GT31 contain a greater sequence diversity than families for which most prediction tools performed well (expression(paste(“F”, beta, “-score”)) greater than 0.75). A greater sequence diversity in the family would make accurately modeling these families and thus predicting family members more difficult.
Figure 5.7: Heatmap of GlycosylTransferases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.
Figure 5.8 plots the specificity against sensitivity from each CAZyme prediction for each PL family. All prediction tools showed a very strong specificity performance, with no family scoring less than 0.9997. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.
The spread (meaning variation) in sensitivity scores for PL families was greater than that for GH and GT families. HMMER showed the strongest performance, because it had the most families with a sensitivity score of greater than 0.9, 16 PL families. Hotpep and eCAMI had the most families with a sensitivity of less than 0.75, 9 families each; however, eCAMI had fewer families score a sensitivity of greater than or equal to 0.9 than Hotpep (6 and 10 families, respectively), therefore, eCAMI showed the weakest performance.
Figure 5.8: Scatter plot of specificity against sensitivity for each CAZy family within the Polysaccharide Lyases class.
To identify GT CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.9. It was no suprise PL0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as PLs but cannot not determine the CAZy family annotation, therefore PL0 includes members from multiple different CAZy families. Thus, PL0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for PL0 is typcially lower than other CAZy families for all prediction tools.
CUPP and eCAMI do not include any PL families newer than PL28, therefore, they could not predict members of PL31, PL33 and PL38. dbCAN and incorporated HMMER, Hotpep and DIAMOND are based upon on more recent version of CAZy and do include PL31 and PL33 but they do not include PL38.
The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is significantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)). 5 members of PL17 were included across the tests, and three prediction tools produced a expression(paste(“F”, beta, “-score”)) greater than 0.88, inferring the tools may perform well against PL17 family members but the limited sample favours producing a lower expression(paste(“F”, beta, “-score”)).
Figure 5.9: Heatmap of Polysaccharide Lyases families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.
Figure 5.10 plots the specificity against sensitivity from each CAZyme prediction for each CE family.
All prediction tools showed a very strong specificity performance, with no family scoring less than 0.998. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.
dbCAN, HMMER, and Hotpep showed the strongest performance with the most families scoring a sensitivity equal to or greater than 0.75. dbCAN and HMMER showed a slightly stronger performance than Hotpep with more families with a sensitivity equal to or greater than 0.9. Unlike previously showing one of the strongest performances, DIAMOND showed the weakest performance with the most families with a sensitivity less than 0.75. However, eCAMI had the fewest families with a sensitivity equal to or greater than 0.9.
Figure 5.10: Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate Esterases class.
To identify CE CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure ??. It was no surprise CE0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as CEs but cannot not determine the CAZy family annotation, therefore CE0 includes members from multiple different CAZy families. Thus, CE0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for CE0 is typically lower than other CAZy families for all prediction tools.
None of the prediction tools include the CAZy family CE18, therefore, known of the prediction tools could predict any of the proteins belong to CE18, resulting in poor performances.
CE16 is included in all the prediction tools but only three family members were included across all test sets. A sample size this small significantly increases the probability of producing a low expression(paste(“F”, beta, “-score”)). Therefore, the prediction tools are unlikely to truly perform poorly for CE16, the sample size was more influential in producing a low expression(paste(“F”, beta, “-score”)).
Figure 5.11: Heatmap of Carbohydrate Esterase families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.
Figure 5.12 plots the specificity against sensitivity from each CAZyme prediction for each CE family.
All prediction tools showed a very strong specificity performance, with no family scoring less than 0.997. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.
HMMER had the most families with a sensitivity score equal to or greater than 0.9, and thus showed the strongest performance. DIAMOND, CUPP and eCAMI showed the weakest performance with the most families scoring a sensitivity less than 0.75.
CUPP and dbCAN demonstrated similarly strong performances, with both tools with 8 AA families with a sensitivity equal to or greater than 0.9. However, dbCAN had more families with a sensitivity equal to or greater than 0.75 than CUPP, therefore, dbCAN showed a slightly stronger performance than CUPP.
Figure 5.12: Scatter plot of specificity against sensitivity for each CAZy family within the Auxiliary Activities class.
To identify AA CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.13. It was no suprise AA0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified as AAs but cannot not determine the CAZy family annotation, therefore AA0 includes members from multiple different CAZy families. Thus, AA0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for AA0 is typcially lower than other CAZy families for all prediction tools.
The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is significantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).
Figure 5.13: Heatmap of Auciliary Activities families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.
Figure 5.14 plots the specificity against sensitivity from each CAZyme prediction for each CE family.
CUPP predicted not members of any CBM families. CUPP was invoked three times, and the output files searched for predictions of CBM families but none were found. Therefore, CUPP showed the weakest performance of CBM families.
dbCAN and DIAMOND had the most families with a sensitivity greater than or equal to 0.9 (35 families each), and had similar numbers of families with a sensitivity greater than 0.75 (44 and 43 families respectively). Therefore, both dbCAN and DIAMOND showed the strongest performance.
Hotpep showed a slightly stronger perfromance than eCAMI with more families with a sensitivity greater than or equal to 0.9 and 0.75.
All prediction tools (except CUPP) showed a very strong specificity performance, with no family scoring less than 0.98. The differences between specificity scores were so small the performances of the prediction tools could not be differentiated by specificity.
Figure 5.14: Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate-Binding Modules class.
Figure 5.14: Scatter plot of specificity against sensitivity for each CAZy family within the Carbohydrate-Binding Modules class.
To identify CBM CAZy families that most prediction tools performed poorly, families for which at least three prediction tools produced a expression(paste(“F”, beta, “-score”)) of less than 0.75, as shown in figure 5.15. It was no surprise CBM0 was included because CAZy classifies this family as ‘unclassified’. The family includes CAZymes that CAZy has classified asCBM but cannot not determine the CAZy family annotation, therefore AA0 includes members from multiple different CAZy families. Thus, CBM0 has a higher sequence diversity, causing the accurate modeling of this family to be more difficult. Consequently, the performance for CBM0 is typically lower than other CAZy families for all prediction tools.
The remaining families, contained very small sample sizes of less than 10 proteins. Thus, the odds of producing a low expression(paste(“F”, beta, “-score”)) is significantly increased, and the probability of producing a low expression(paste(“F”, beta, “-score”)) is much greater than producing a high expression(paste(“F”, beta, “-score”)).
…
Figure 5.15: Heatmap of Carbohydrate Binding Module families for which at least three CAZyme prediction tools produced a poor performance, defined as a Fbeta-score less than 0.75. ‘Family population’ is the number of CAZyme records in each family in CAZy, and ‘Sample size’ is the number of proteins from the CAZy family included across all test sets.